feat(inspect): add ArrowRowBuilder for materializing Arrow batches by WZhuo · Pull Request #780 · apache/iceberg-cpp

WZhuo · 2026-06-25T03:19:03Z

Summary

Adds ArrowRowBuilder (arrow_row_builder_internal.h / arrow_row_builder.cc), a schema-driven RAII helper that materializes in-memory rows into an Arrow ArrowArray (a struct batch) for an arbitrary Iceberg schema. It wraps the nanoarrow boilerplate and exposes per-column access plus typed append free functions, so metadata tables (snapshots, history, manifests, …) can emit rows without re-implementing it.

This is the first of a series splitting metadata-table support into focused PRs; the InMemoryBatchReader and the SnapshotsTable::Scan integration are intended to follow in separate PRs that build on this.

What's included

ArrowRowBuilder — a single RAII class (move-only) with Make(const Schema&) and Make(const ArrowSchema*) overloads. Handles the full nanoarrow lifecycle: InitFromSchema → StartAppending → … append values … → FinishBuilding → Release. The ArrowArray is guarded immediately after InitFromSchema so a failure in StartAppending releases it automatically.
ArrowArrayGuard::Release() — added to the existing guard so other call sites (position_delete_writer, manifest_adapter) can reuse the RAII-release pattern instead of manually managing nanoarrow resources.
Free functions in the iceberg namespace: AppendNull, AppendBoolean, AppendInt (covers int32/int64/timestamp via nanoarrow's int64), AppendString, AppendStringMap.
The implementation lives at the core iceberg library level — it only needs nanoarrow + ToArrowSchema (no Apache Arrow), matching peers like manifest_adapter and arrow_c_data_util.
Unit tests in arrow_row_builder_test.cc covering typed appends (int32/string/int64/boolean/map), null handling for optional columns, multi-entry/empty string maps, zero-row batches, and column-index bounds. Compiled into the iceberg-data-test test target.

Testing

CMake (Ninja): cmake --build build --target iceberg-data-test then ran the test binary — 5/5 ArrowRowBuilderTest tests pass. ctest green.
The test verifies output by importing the produced C-data into Apache Arrow (arrow::ImportRecordBatch), so its target is under USE_BUNDLE.

Notes

The test is registered under CMake's bundle build only. The meson build (which has no Apache Arrow/bundle layer) is left unchanged; the core-only test target continues to build there.
Developed with AI-assisted tooling, reviewed by the author.

Add ArrowRowBuilder (inspect/row_builder_internal) to materialize in-memory rows into an ArrowArray for an arbitrary Iceberg schema, with typed append helpers (AppendNull/Boolean/Int/String/StringMap) reused by later metadata tables.

wgtmac · 2026-06-25T15:29:16Z

I think this is a good point to factor out a small internal nanoarrow builder/helper layer instead of adding another local copy of the same lifecycle code.

We already have the same pattern in a few production paths: initialize from schema, start appending, append child values, finish each row/list/map element, then finish building. For example, position_delete_writer.cc and arrow_c_data_util.cc both guard the initialized ArrowArray immediately after ArrowArrayInitFromSchema, while this new builder currently misses that and can leak if ArrowArrayStartAppending fails. manifest_adapter.cc also has local primitive/list/map append helpers, including map-shape validation, and this PR adds another set of primitive/string-map helpers.

A concrete way to keep this scoped would be:

Add an internal move-only nanoarrow array builder/owner that handles init/start/finish/release/ownership transfer.
Move the simple append helpers (null, signed/unsigned int, double, string, bytes, and maybe map entry validation) behind that internal helper.
Keep higher-level code like ManifestAdapter, batch projection, and metadata-table row materialization separate, since those have different domain semantics.

That should fix the leak path here and avoid repeating this fragile nanoarrow lifecycle in the next metadata-table PRs without turning this into a broad refactor.

- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target

WZhuo · 2026-06-29T07:52:02Z

Thanks for the thorough review. All three points addressed:

ArrowRowBuilder is now a single RAII class (merged the two classes), moved to core level, with Make(const Schema&) and Make(const ArrowSchema*) overloads so both metadata tables and lower-level code like position_delete_writer/manifest_adapter can use it.
Append* helpers are free functions alongside ArrowRowBuilder: null, int (int32/int64/timestamp), boolean, string, and string_map.
The lifecycle is properly guarded: ArrowArrayGuard right after InitFromSchema, Release() after StartAppending succeeds. Also added Release() to ArrowArrayGuard for other call sites to reuse this pattern.

- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target

WZhuo force-pushed the arrow-row-builder branch from b0a8615 to b42f0da Compare June 25, 2026 03:38

wgtmac reviewed Jun 25, 2026

View reviewed changes

Comment thread src/iceberg/CMakeLists.txt Outdated

wgtmac reviewed Jun 25, 2026

View reviewed changes

Comment thread src/iceberg/inspect/row_builder_internal.h Outdated

wgtmac reviewed Jun 25, 2026

View reviewed changes

Comment thread src/iceberg/inspect/row_builder_internal.cc Outdated

WZhuo force-pushed the arrow-row-builder branch from b82b9ba to f592f6d Compare June 29, 2026 07:54

WZhuo force-pushed the arrow-row-builder branch from f592f6d to d221034 Compare June 29, 2026 07:58

WZhuo force-pushed the arrow-row-builder branch from d221034 to 1212581 Compare June 29, 2026 08:29

WZhuo force-pushed the arrow-row-builder branch from 1212581 to 030c221 Compare June 29, 2026 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780
WZhuo wants to merge 2 commits into
apache:mainfrom
WZhuo:arrow-row-builder

WZhuo commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Jun 25, 2026

Uh oh!

WZhuo commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WZhuo commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Testing

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Jun 25, 2026

Uh oh!

WZhuo commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WZhuo commented Jun 25, 2026 •

edited

Loading